Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-796] Support for globally disable thread-local cache in the shared PooledByteBufAllocator #1716

Closed
wants to merge 4 commits into from

Conversation

cfmcgrady
Copy link
Contributor

@cfmcgrady cfmcgrady commented Jul 14, 2023

What changes were proposed in this pull request?

As title

Why are the changes needed?

As title

Does this PR introduce any user-facing change?

Yes, the thread local cache of shared PooledByteBufAllocator can be disabled by setting celeborn.network.memory.allocator.allowCache=false

How was this patch tested?

Pass GA

@cfmcgrady cfmcgrady marked this pull request as draft July 14, 2023 08:50
@cfmcgrady cfmcgrady changed the title [CELEBORN-796] Support for globally disabled thread-local cache in the shared PooledByteBufAllocator [WIP][CELEBORN-796] Support for globally disabled thread-local cache in the shared PooledByteBufAllocator Jul 14, 2023
@codecov
Copy link

codecov bot commented Jul 14, 2023

Codecov Report

Merging #1716 (176cfb1) into main (efc9a87) will decrease coverage by 0.17%.
Report is 5 commits behind head on main.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main    #1716      +/-   ##
==========================================
- Coverage   46.86%   46.68%   -0.17%     
==========================================
  Files         162      162              
  Lines       10027    10088      +61     
  Branches      923      929       +6     
==========================================
+ Hits         4698     4709      +11     
- Misses       5019     5071      +52     
+ Partials      310      308       -2     
Files Changed Coverage Δ
...pache/celeborn/common/network/util/NettyUtils.java 37.94% <61.54%> (+8.31%) ⬆️
...cala/org/apache/celeborn/common/CelebornConf.scala 87.47% <100.00%> (+0.04%) ⬆️

... and 12 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pan3793
Copy link
Member

pan3793 commented Jul 14, 2023

Will disable cache hurt performance and increase cpu usage?

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Aug 4, 2023
@pan3793
Copy link
Member

pan3793 commented Aug 7, 2023

Let's revise this PR, it's beneficial to release the direct memory.

I hit worker memory issues recently, the worker triggers trim action, then the disk buffer become empty but direct memory does not reach the threshold, causing the worker being a PAUSE state forever.

MemoryManager#183 - Direct memory usage: 1536.0 MiB/3.0 GiB, disk buffer size: 0.0 B, sort memory size: 0.0 B, read buffer size: 0.0 B
StorageManager#51 - Trigger org.apache.celeborn.service.deploy.worker.storage.StorageManager trim action
StorageManager#51 - Trigger org.apache.celeborn.service.deploy.worker.storage.StorageManager trim action
...

@@ -42,7 +42,8 @@

/** Utilities for creating various Netty constructs based on whether we're using EPOLL or NIO. */
public class NettyUtils {
private static volatile PooledByteBufAllocator _allocator;
private static final PooledByteBufAllocator[] _sharedPooledByteBufAllocator =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about splitting it into 2 variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's derived from Spark and I prefer to keep it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, both are fine.

@pan3793
Copy link
Member

pan3793 commented Aug 7, 2023

cc @waitinfuture

@pan3793 pan3793 removed the stale label Aug 7, 2023
@cfmcgrady cfmcgrady changed the title [WIP][CELEBORN-796] Support for globally disabled thread-local cache in the shared PooledByteBufAllocator [WIP][CELEBORN-796] Support for globally disable thread-local cache in the shared PooledByteBufAllocator Aug 7, 2023
@cfmcgrady cfmcgrady marked this pull request as ready for review August 7, 2023 07:31
@cfmcgrady cfmcgrady changed the title [WIP][CELEBORN-796] Support for globally disable thread-local cache in the shared PooledByteBufAllocator [CELEBORN-796] Support for globally disable thread-local cache in the shared PooledByteBufAllocator Aug 7, 2023
@pan3793
Copy link
Member

pan3793 commented Aug 8, 2023

I think it's good to go as it does not change the default behavior.

Generally disabling cache would hurt performance, any chance to provide a benchmark with cache enabled/disabled? e.g. terasort https://github.com/pan3793/spark-terasort

@pan3793
Copy link
Member

pan3793 commented Aug 8, 2023

BTW, I rolled this patch on a small cluster, it shows helps with memory reduction without
obvious performance dropping.

Before (cache enabled)
image
image

After (cache disabled)
image
image

buildConf("celeborn.network.memory.allocator.allowCache")
.categories("network")
.internal
.version("0.4.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 0.3.1, if @waitinfuture not oppose to

Copy link
Contributor

@waitinfuture waitinfuture Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No opposition:) I think we should merge this PR to 0.3.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graph looks great!

@cfmcgrady
Copy link
Contributor Author

the netty memory grows more slowly than before in our production environment after applying this patch.

Before this PR (cache enabled)

截屏2023-08-09 上午11 04 17

After this PR (cache disabled)

截屏2023-08-09 上午11 04 54

@waitinfuture
Copy link
Contributor

Thanks! Merging to main/0.3

waitinfuture pushed a commit that referenced this pull request Aug 11, 2023
… shared PooledByteBufAllocator

### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

Yes, the thread local cache of shared `PooledByteBufAllocator` can be disabled by setting `celeborn.network.memory.allocator.allowCache=false`

### How was this patch tested?

Pass GA

Closes #1716 from cfmcgrady/allow-cache.

Authored-by: Fu Chen <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit 6f1bb41)
Signed-off-by: zky.zhoukeyong <[email protected]>
@cfmcgrady cfmcgrady deleted the allow-cache branch August 11, 2023 14:09
waitinfuture added a commit that referenced this pull request Aug 15, 2023
…lt to false

### What changes were proposed in this pull request?
As title

### Why are the changes needed?
I tested 1.1T and 3.3T shuffle, as well as 3T TPCDS with thread cache on and off in the shared PooledByteBufAllocator and find no
difference:
| Benchmark    | Cache On | Cache Off|
| -------- | ------- |------- |
|1.1T Shuffle| 3.7min/1.9min   |3.7min/1.9min|
| 3.3T Shuffle| 12min/6.7min  |12min/6.2min|
| 3T TPCDS | 2645s |2644s|

And since the configuration has a big influence to the direct memory usage, see #1716 , it's very necessary to set the default value to false.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1817 from waitinfuture/897.

Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
waitinfuture added a commit that referenced this pull request Aug 15, 2023
…lt to false

### What changes were proposed in this pull request?
As title

### Why are the changes needed?
I tested 1.1T and 3.3T shuffle, as well as 3T TPCDS with thread cache on and off in the shared PooledByteBufAllocator and find no
difference:
| Benchmark    | Cache On | Cache Off|
| -------- | ------- |------- |
|1.1T Shuffle| 3.7min/1.9min   |3.7min/1.9min|
| 3.3T Shuffle| 12min/6.7min  |12min/6.2min|
| 3T TPCDS | 2645s |2644s|

And since the configuration has a big influence to the direct memory usage, see #1716 , it's very necessary to set the default value to false.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1817 from waitinfuture/897.

Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit 57fdbf0)
Signed-off-by: zky.zhoukeyong <[email protected]>
gotikkoxq added a commit to gotikkoxq/celeborn that referenced this pull request Aug 26, 2024
…lt to false

### What changes were proposed in this pull request?
As title

### Why are the changes needed?
I tested 1.1T and 3.3T shuffle, as well as 3T TPCDS with thread cache on and off in the shared PooledByteBufAllocator and find no
difference:
| Benchmark    | Cache On | Cache Off|
| -------- | ------- |------- |
|1.1T Shuffle| 3.7min/1.9min   |3.7min/1.9min|
| 3.3T Shuffle| 12min/6.7min  |12min/6.2min|
| 3T TPCDS | 2645s |2644s|

And since the configuration has a big influence to the direct memory usage, see apache/celeborn#1716 , it's very necessary to set the default value to false.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1817 from waitinfuture/897.

Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants